---
title: ArxivLoader
---

[arXiv](https://arxiv.org/) is an open-access archive for 2 million scholarly articles in the fields of physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics.

## Setup

To access Arxiv document loader you'll need to install the `arxiv`, `PyMuPDF` and `langchain-community` integration packages. PyMuPDF transforms PDF files downloaded from the arxiv.org site into the text format.

```python
%pip install -qU langchain-community arxiv pymupdf
```

## Instantiation

Now we can instantiate our model object and load documents:

```python
from langchain_community.document_loaders import ArxivLoader

# Supports all arguments of `ArxivAPIWrapper`
loader = ArxivLoader(
    query="reasoning",
    load_max_docs=2,
    # doc_content_chars_max=1000,
    # load_all_available_meta=False,
    # ...
)
```

## Load

Use `.load()` to synchronously load into memory all Documents, with one
Document per one arxiv paper.

Let's run through a basic example of how to use the `ArxivLoader` searching for papers of reasoning:

```python
docs = loader.load()
docs[0]
```

```output
Document(page_content='Hypothesis Testing Prompting Improves Deductive Reasoning in\nLarge Language Models\nYitian Li1,2, Jidong Tian1,2, Hao He1,2, Yaohui Jin1,2\n1MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University\n2State Key Lab of Advanced Optical Communication System and Network\n{yitian_li, frank92, hehao, jinyh}@sjtu.edu.cn\nAbstract\nCombining different forms of prompts with pre-trained large language models has yielded remarkable results on\nreasoning tasks (e.g. Chain-of-Thought prompting). However, along with testing on more complex reasoning, these\nmethods also expose problems such as invalid reasoning and fictional reasoning paths. In this paper, we develop\nHypothesis Testing Prompting, which adds conclusion assumptions, backward reasoning, and fact verification during\nintermediate reasoning steps. Hypothesis Testing prompting involves multiple assumptions and reverses validation of\nconclusions leading to its unique correct answer. Experiments on two challenging deductive reasoning datasets\nProofWriter and RuleTaker show that hypothesis testing prompting not only significantly improves the effect, but also\ngenerates a more reasonable and standardized reasoning process.\nKeywords: Deductive Reasoning, Large Language Models, Prompt\n1.\nIntroduction\nThe release of large language models (LLMs) has\nrevolutionized the NLP landscape recently (Thop-\npilan et al., 2022; Kaplan et al., 2020; Chowdh-\nery et al., 2022). Scaling up the size of language\nmodels and conducting diversified prompt meth-\nods become mainstream (Liu et al., 2023c; Wei\net al., 2022a; Yang et al., 2023). Given In-context\nlearning or Chain-of-Thought prompts have already\nachieved high performance on challenging tasks\nsuch as commonsense, arithmetic, and symbolic\nreasoning (Imani et al., 2023; Lee et al., 2021;\nKojima et al., 2022). Logical reasoning is one of\nthe most important and long-standing problems in\nNLP (Hirschberg and Manning, 2015; Russell and\nNorvig, 2010), and integrating this ability into nat-\nural language understanding systems has always\nbeen a goal pursued (Du et al., 2022).\nNevertheless, scaling has been demonstrated\nto offer limited advantages in resolving complex\nlogical reasoning issues (Kazemi et al., 2022). For\nexample, Saparov and He (2022) show that Chain-\nof-Thought prompting struggles with proof planning\nfor more complex logical reasoning problems. Addi-\ntionally, the performance suffers greatly while han-\ndling recently released and out-of-distribution logi-\ncal reasoning datasets (Liu et al., 2023a). Despite\nmany works have explored variants of Chain-of-\nThought prompts to facilitate LLMs inference (Zelik-\nman et al., 2022; Zheng et al., 2023), we discover\nthat the present logical reasoning task prompts\nplace an excessive amount of emphasis on the\nreasoning process while ignoring the origin, pur-\npose, and effectiveness of reasoning (Creswell\net al., 2022; Xi et al., 2023). As examples shown in\nQ1: Bob is green. True/false? \nInput Facts: Alan is blue. Alan is \nrough. Alan is young. Bob is big. \nBob is round. Charlie is big. Charlie \nis blue. Charlie is green. Dave is \ngreen. Dave is rough.\nInput Rules: Big people are rough. \nIf someone is young and round then \nthey are kind. If someone is round \nand big then they are blue. All\nrough people are green.\nBob is big. \nBig people are rough. \nAll rough people are green.\nAnswer: T\nQ2: Dave is blue. True/false?\nLess inspiring\nFigure 1: Questions in RuleTaker involve logical\nreasoning with facts and rules.\nFigure 1, the difficulty in judging logical problems\narises not only from the process of reasoning but\nalso from the choice of facts and rules to use as a\nstarting point. Even if we were provided the thought\nprocess for some of the issues, it would not be very\nbeneficial for others, based on how we previously\ncreated the prompts.\nIn this paper, we propose Hypothesis Testing\nPrompting, a new and more considerate prompt\ntemplate design idea. Hypothesis testing is a for-\nmal procedure for investigating our ideas about\nthe world using statistics and is often used by sci-\nentists to test specific predictions (Bevans, 2022).\nWe draw inspiration from its process to introduce a\nprocess of conclusion assumptions, backward rea-\nsoning, and fact verification. Experiments on Rule-\nTaker (Clark et al., 2020) and ProofWriter (Tafjord\net al., 2021) show the effectiveness of our novel\nprompting paradigm as a strategy for promoting\ndeductive reasoning in large language models. Fur-\nther analyses show that Hypothesis Test prompting\ngenerates more desirable intermediate processes\narXiv:2405.06707v1  [cs.CL]  9 May 2024\nand significantly improves the "Unknown" label.\n2.\nRelated Work\n2.1.\nFew-Shot Prompting\nBrown et al. (2020) propose in-context learning as\nan alternative few-shot prompting way to stimulate\nability. Besides, chain-of-Thought (CoT) (Wei et al.,\n2022b) is one of the most well-known works, which\ndecomposes the problem into intermediate steps\nand further improves the ability of large language\nmodels. Subsequently, several follow-up works\nwere carried out, including Zero-shot-CoT (simply\nadding "Let’s think step by step" before each an-\nswer) (Kojima et al., 2022), Self-consistency (Wang\net al., 2022), complexity-based (Fu et al., 2022),\nand other prompting work (Liu et al., 2023b; Jung\net al., 2022; Zhou et al., 2022; Saparov and He,\n2022). While these methods enhance the perfor-\nmance of inference by paying attention to indica-\ntions of the reasoning process, they often overlook\nsome aspects such as identifying the root cause of\nthe problem, establishing efficient reasoning strate-\ngies, and determining the direction of logical rea-\nsoning.\n2.2.\nDeductive Reasoning\nDeductive reasoning is defined as the applica-\ntion of general concepts to particular circum-\nstances (Johnson-Laird, 2010). Making logical as-\nsumptions is the foundation of deductive reasoning,\nwhich then bases a conclusion on those assump-\ntions. The deduction task is then applied to a sit-\nuation from the actual world after starting with a\nrule. In light of the principles "All men are mortal."\nand "Socrates is a man." for example, we can draw\nthe conclusion that "Socrates is mortal." (Johnson-\nLaird, 1999).\n3.\nHypothesis Testing Prompting\nHypothesis testing is a formal procedure for investi-\ngating our ideas about the world using statistics and\nused by scientists to test specific predictions that\narise from theories (Bevans, 2022; La et al., 2012).\nThere are 5 main steps in hypothesis testing:\n1. State your research hypothesis;\n2. Collect data in a way designed to test the hy-\npothesis;\n3. Perform an appropriate statistical test;\n4. Decide whether to reject or fail to reject your\nnull hypothesis;\n5. Present the findings in your results and discus-\nsion section;\nWhen completing a challenging reasoning activ-\nity, such as a multi-step deductive reasoning prob-\nlem, one is not conducting random reasoning to\nobtain all possible intermediate results. We shall\nchoose the relevant conditions for inference ver-\nification after initially making assumptions about\nthe judgment problem, such as " First assume the\nconclusion is True and start from ... Then assume\nthe conclusion is False and start from ... because\nthe rules state that ... So the conclusion ...". The\npurpose of this study is to give language models\nthe capacity to build a process that is similar to\nwhat we defined as Hypothesis Testing Prompt-\ning. We will show that large language models can\ngenerate more appropriate thought and more ac-\ncurate results if demonstrations of hypothesis test\nprompting are provided in the exemplars for few-\nshot prompting. Figure 2 shows an example of a\nmodel producing a hypothesis testing thought to\nsolve a deductive reasoning problem.\n4.\nExperiment\n4.1.\nExperimental Setup\nWe explore Hypothesis Test Prompting for Chat-\nGPT (GPT-3.5-Turbo in the OpenAI API) on multiple\nlogical reasoning benchmarks.\nBenchmarks. Considering FOL reasoning in\nquestion answering systems, there are two world\nassumptions (Reiter, 1981) that result in different\nobjectives. One is the closed world assumption\n(CWA), which is the presumption that what is not\ncurrently known to be entailment is contradiction.\nThe other is the open world assumption (OWA),\nwhose objective should distinguish false proposi-\ntions from uncertain ones. Due to differences in\nworld assumptions, our analysis and solutions are\nalso different.\nWe consider the following two deductive reason-\ning problem benchmarks: (1) the RuleTaker (Clark\net al., 2020) benchmark using CWA assumption;\n(2) the ProofWriter (Tafjord et al., 2021) benchmark\nusing OWA assumption. Both datasets are divided\ninto five parts, each part requiring 0, ≤1, ≤2, ≤\n3, and ≤5 hops of reasoning, respectively. We\nconducted comparison tests on the test set of the\ntwo datasets for 5 distinct hops.\nStandard prompting. As one of the baselines,\nwe take into account the common few-shot prompt-\ning, made popular by Brown et al. (2020), in which\na language model is provided with in-context ex-\namples of input-output pairings before producing a\nprediction for a test-time example. Examples are\npresented in the form of questions and answers.\nAs seen in Figure 2(above), the model directly an-\nswers the question.\nChain-of-Thought prompting. We also com-\nChain-of-Thought Prompting\nModel Input\nStandard Prompting\nModel Input\nQ: Judge the following conclusion \'Harry is cold.\' is correct, wrong, or unknown, based on the following texts: Text 1\nA: The conclusion \'Harry is cold.\' is True.\nQ: Judge the following conclusion ‘Bob is not young.’ is correct, wrong, or unknown, based on the following texts: Text 2\nModel Output\nText 1: Charlie is rough. Charlie is white. Erin is blue. Erin is rough. \nErin is young. Gary is blue. Gary is green. Gary is quiet. Gary is \nwhite. Harry is blue. Harry is rough. All rough, white things are \ngreen. All young, blue things are quiet. If something is young then it \nis blue. White, blue things are quiet. If something is green then it is \nyoung. Quiet, green things are cold. All green, cold things are rough.\nQuestion: Harry is cold.     \nAnswer: True\nText 2: Bob is rough. Fiona is not furry. Fiona is white. Young, nice \nthings are big. Blue things are big. If something is young and white \nthen it is nice. All nice things are young. All white things are young. \nIf something is blue and not big then it is not rough.\nQuestion: Bob is not young. \nAnswer: Unknown\nQ: Judge the following conclusion \'Harry is cold.\' is correct, wrong, or unknown, based on the following texts: Text 1\nA: Start from \'Harry is quiet\' and \'Quiet things are nice\'. Then can know \'Harry is nice\'. And \'All nice things are cold\' is given. Finally \nit can be deduced \'Harry is cold\'. So the conclusion \'Harry is cold.\' is True.\nQ: Judge the following conclusion \'Harry is not quiet.\' is correct, wrong, or unknown, based on the following texts: Text  2\nModel Output\nHypothesis Testing Prompting\nModel Input\nQ: Judge the following conclusion \'Harry is cold.\' is correct, wrong, or unknown, based on the following texts: Text 1\nA: First assume the conclusion is True and start from Harry is cold. Because the rules state that all nice things are cold, correct\nconclusion needs to be satisfied that Harry is nice. And the rules states that quiet things are nice. It needs to be satisfied Harry is quiet, \nwhich is also stated in the facts. So the conclusion \'Harry is cold.\' is True.\nQ: Judge the following conclusion ‘Bob is not young.’ is correct, wrong, or unknown, based on the following texts: Text  2\nQ: The conclusion \'Bob is not young.\' is True.\nQ: Start from \'Bob is rough\' and \'If something is blue and not big then it is not rough\'. Then can know \'Bob is not blue\'. And \'All white \nthings are young\' is given. Finally it can be deduced \'Bob is not young\'. So the conclusion \'Bob is not young.\' is True.\nModel Output\nQ: First assume the conclusion is True and start from Bob is not young. Because the rules state that all white things are young, correct \nconclusion needs to be satisfied that Bob is not white. And there are no facts or rules about \'not white\'. Then assume the conclusion is \nFalse and start from Bob is young. Because the rules state that all nice things are young, correct conclusion needs to be satisfied that \nBob is nice. And there are no facts or rules about \'nice\'. So the conclusion \'Bob is not young.\' is Unknown.\n× N\n× N\n× N\nFigure 2: Comparison of three prompting methods: (a) Standard (b) Chain-of-Thought (c) Hypothesis\nTesting. Particularly, we highlight the Hypothesis testing reasoning processes. The comparative experi-\nmental results show that: Hypothesis testing prompting enables large language models to tackle complex\nlogical reasoning.\npare with Chain-of-thought prompting which has\nachieved encouraging results on complex reason-\ning tasks (Wei et al., 2022b).\nAs seen in Fig-\nure 2(middle), the model not only provides the final\nanswer but also comes with the consideration of\nintermediate steps.\nHypothesis Testing Prompting. Our proposed\napproach is to augment each exemplars in few-shot\nprompting with the thought of hypothesis testing for\nan associated answer, as illustrated in Figure 2(be-\nlow). We show one chain of thought exemplars\n(Example: Judge the following conclusion ’<Con-\nclusion>’ is true, false, or unknown, based on the\nfollowing facts and rules: <Facts> ... <Rules> ...).\n4.2.\nExperimental Results\nThe results for Hypothesis Testing Prompting and\nthe baselines on the RuleTaker datasets are pro-\nvided in Figure 3(a), and ProofWriter results are\nshown in Figure 3(b). From the results, we ob-\nserve that our method significantly outperforms the\nother two baselines, especially on ProofWriter. Fig-\nure 3(a) demonstrates that while CoT performs well\nin the low hop, Hypothesis Testing prompting per-\nforms better as the hops count increases on Rule-\nTaker. While on ProofWriter, our approach has a\nthorough lead (improved accuracy by over 4% on\nall hops). Comparing two datasets, the latter dis-\ntinguishes between "False" and "Unknown", which\ndemand a greater level of logic. The results on two\n0.97\n0.93\n0.83\n0.84\n0.81\n0.99\n0.92\n0.77\n0.8\n0.78\n0.78\n0.72\n0.63\n0.63\n0.65\n0\n0.1\n0.2\n0.3\n0.4\n0.5\n0.6\n0.7\n0.8\n0.9\n1\ndepth-0\ndepth-1\ndepth-2\ndepth-3\ndepth-5\nAccuracy\nStandard Prompting\nChain-of-Thought Prompting\nHypothesis Testing Prompting\n(a)\n0.6\n0.41\n0.39\n0.34\n0.33\n0.77\n0.54\n0.58\n0.56\n0.5\n0.82\n0.63\n0.62\n0.61\n0.57\n0\n0.1\n0.2\n0.3\n0.4\n0.5\n0.6\n0.7\n0.8\n0.9\n1\ndepth-0\ndepth-1\ndepth-2\ndepth-3\ndepth-5\nAccuracy\nStandard Prompting\nChain-of-Thought Prompting\nHypothesis Testing Prompting\n(b)\nFigure 3: Prediction accuracy results on the (a) RuleTaker and (b) ProofWriter datasets.\n0.74\n0.26\n0\n0.2\n0.4\n0.6\n0.8\n1\nHypothesis Testing\nChain-of-Thought\n(a) The proof accuracy of Chain-of-Thought and Hy-\npothesis Testing prompting.\n0.65\n0.3\n0\n0.2\n0.4\n0.6\n0.8\n1\nHypothesis Testing\nChain-of-Thought\n(b) Comparison of accuracy between Chain-of-\nThought and Hypothesis Testing prompting on "Un-\nknown" label.\nFigure 4: Further results on ProofWriter.\ndatasets that were analyzed show a weakness in\nall methods for handling "Unknown" labels. This\nbeacuse the OWA hypothesis necessitates the ex-\nclusion of both positive and negative findings to\nvalidate the "Unknown" label. The advantages of\nour strategy are illustrated by the comparison of the\nmodel output outputs in Figure 2. The content "First\nassume the conclusion is True ... Then assume\nthe conclusion is False ... So ... is Unknown." gen-\nerated by the model through learning Hypothesis\nTesting prompting is more in line with our thinking.\nBesides, we’ll conduct further research and show\nit later.\n4.3.\nFurther Analysis\nWe carry out the following thorough analysis to\nbetter comprehend the thought process:\nProof Accuracy. Five students are required to\nmanually evaluate the outcomes of the intermediate\nreasoning after we randomly picked 100 examples\nfrom depth-5 of the ProofWriter. Proof accuracy rep-\nresents the proportion where the inference process\nhas been proven to be reasonable in the correct\npart of data label prediction. We compare the re-\nsults of Chian-of-Thought and Hypothesis Testing\nprompting and report in Figure 4(a). While Hy-\npothesis Testing prompting mostly produced the\ncorrect intermediate reasoning process when the\npredicted label was correct, CoT only generated\nthe correct chain for 26% of the examples. This\nresult is in line with other research showing that\nLMs rely on spurious correlations when solving log-\nical problems from beginning to end. Additionally,\nour approach can successfully increase reason-\ning’s rationality. In processing the "Unknown" label,\nHypothesis Testing prompting performs noticeably\nbetter than Chain-of-Thought.\n"Unknown" accuracy.\nIn the ProofWriter\ndataset, we separately counted the accuracy of\nthe "Unknown" label shown in Figure 4(b). The\nresults point to a flaw in the Chain-of-Thought strat-\negy’s handling of "Unknown" labels(only 0.3 accu-\nracyies). Contrarily, Hypothesis Testing prompting\nsignificantly increases the reliability of judging this\nlabel (up to 0.65). This further illustrates the value\nof holding various assumptions, as well as the re-\nverse confirmation of conclusions.\n5.\nConclusion\nWe have investigated Hypothesis Testing prompt-\ning as a straightforward and widely applicable tech-\nnique for improving deductive reasoning in large\nlanguage models. Multiple assumptions are made\nduring hypothesis testing, and conclusions are\nreverse-validated to arrive at the one and only ac-\ncurate answer. Through experiments on two logical\nreasoning datasets, we find that Hypothesis Test-\ning prompting allows large language models to con-\nstruct reasoning more reasonably and accurately.\nWe anticipate that additional research on language-\nbased reasoning approaches will be stimulated by\nour novel prompting design strategy.\n6.\nReferences\nAlfred V. Aho and Jeffrey D. Ullman. 1972. The\nTheory of Parsing, Translation and Compiling,\nvolume 1. Prentice-Hall, Englewood Cliffs, NJ.\nAmerican Psychological Association. 1983. Publi-\ncations Manual. American Psychological Associ-\nation, Washington, DC.\nRie Kubota Ando and Tong Zhang. 2005. A frame-\nwork for learning predictive structures from multi-\nple tasks and unlabeled data. Journal of Machine\nLearning Research, 6:1817–1853.\nGalen Andrew and Jianfeng Gao. 2007. Scalable\ntraining of L1-regularized log-linear models. In\nProceedings of the 24th International Conference\non Machine Learning, pages 33–40.\nRebecca Bevans. 2022.\nHypothesis Testing |\nA Step-by-Step Guide with Easy Examples.\nScribbr.\nTom B. Brown, Benjamin Mann, Nick Ryder,\nMelanie Subbiah, Jared Kaplan, Prafulla Dhari-\nwal, Arvind Neelakantan, Pranav Shyam, Girish\nSastry, Amanda Askell, Sandhini Agarwal, Ariel\nHerbert-Voss, Gretchen Krueger, Tom Henighan,\nRewon Child, Aditya Ramesh, Daniel M. Ziegler,\nJeffrey Wu, Clemens Winter, Christopher Hesse,\nMark Chen, Eric Sigler, Mateusz Litwin, Scott\nGray, Benjamin Chess, Jack Clark, Christopher\nBerner, Sam McCandlish, Alec Radford, Ilya\nSutskever, and Dario Amodei. 2020. Language\nmodels are few-shot learners. In NeurIPS.\nBSI. 1973a.\nNatural Fibre Twines, 3rd edition.\nBritish Standards Institution, London. BS 2570.\nBSI. 1973b. Natural fibre twines. BS 2570, British\nStandards Institution, London. 3rd. edn.\nA. Castor and L. E. Pollux. 1992. The use of user\nmodelling to guide inference and learning. Ap-\nplied Intelligence, 2(1):37–53.\nAshok K. Chandra, Dexter C. Kozen, and Larry J.\nStockmeyer. 1981. Alternation. Journal of the As-\nsociation for Computing Machinery, 28(1):114–\n133.\nJ.L. Chercheur. 1994. Case-Based Reasoning, 2nd\nedition. Morgan Kaufman Publishers, San Mateo,\nCA.\nYejin Choi. 2022. The curious case of common-\nsense intelligence. Daedalus.\nN. Chomsky. 1973. Conditions on transformations.\nIn A festschrift for Morris Halle, New York. Holt,\nRinehart & Winston.\nAakanksha Chowdhery, Sharan Narang, Jacob\nDevlin, Maarten Bosma, Gaurav Mishra, Adam\nRoberts, Paul Barham, Hyung Won Chung,\nCharles Sutton, Sebastian Gehrmann, Parker\nSchuh, Kensen Shi, Sasha Tsvyashchenko,\nJoshua Maynez, Abhishek Rao, Parker Barnes,\nYi Tay,\nNoam Shazeer,\nVinodkumar Prab-\nhakaran, Emily Reif, Nan Du, Ben Hutchinson,\nReiner Pope, James Bradbury, Jacob Austin,\nMichael Isard, Guy Gur-Ari, Pengcheng Yin,\nToju Duke, Anselm Levskaya, Sanjay Ghemawat,\nSunipa Dev, Henryk Michalewski, Xavier Gar-\ncia, Vedant Misra, Kevin Robinson, Liam Fe-\ndus, Denny Zhou, Daphne Ippolito, David Luan,\nHyeontaek Lim, Barret Zoph, Alexander Spiri-\ndonov, Ryan Sepassi, David Dohan, Shivani\nAgrawal, Mark Omernick, Andrew M. Dai, Thanu-\nmalayan Sankaranarayana Pillai, Marie Pellat,\nAitor Lewkowycz, Erica Moreira, Rewon Child,\nOleksandr Polozov, Katherine Lee, Zongwei\nZhou, Xuezhi Wang, Brennan Saeta, Mark Diaz,\nOrhan Firat, Michele Catasta, Jason Wei, Kathy\nMeier-Hellstern, Douglas Eck, Jeff Dean, Slav\nPetrov, and Noah Fiedel. 2022. Palm: Scaling\nlanguage modeling with pathways. CoRR.\nPeter Clark, Oyvind Tafjord, and Kyle Richardson.\n2020. Transformers as soft reasoners over lan-\nguage. In IJCAI.\nJames W. Cooley and John W. Tukey. 1965. An\nalgorithm for the machine calculation of complex\nFourier series.\nMathematics of Computation,\n19(90):297–301.\nAntonia Creswell, Murray Shanahan, and Irina Hig-\ngins. 2022. Selection-inference: Exploiting large\nlanguage models for interpretable logical reason-\ning. CoRR.\nYilun Du, Shuang Li, Joshua B. Tenenbaum, and\nIgor Mordatch. 2022. Learning iterative reason-\ning through energy minimization. In ICML.\nUmberto Eco. 1990. The Limits of Interpretation.\nIndian University Press.\nYao Fu, Hao Peng, Ashish Sabharwal, Peter Clark,\nand Tushar Khot. 2022.\nComplexity-based\nprompting for multi-step reasoning. CoRR.\nDan Gusfield. 1997. Algorithms on Strings, Trees\nand Sequences. Cambridge University Press,\nCambridge, UK.\nJulia Hirschberg and Christopher D. Manning. 2015.\nAdvances in natural language processing. Sci-\nence.\nPaul Gerhard Hoel. 1971a. Elementary Statistics,\n3rd edition. Wiley series in probability and math-\nematical statistics. Wiley, New York, Chichester.\nISBN 0 471 40300.\nPaul Gerhard Hoel. 1971b. Elementary Statistics,\n3rd edition, Wiley series in probability and mathe-\nmatical statistics, pages 19–33. Wiley, New York,\nChichester. ISBN 0 471 40300.\nShima Imani, Liang Du, and Harsh Shrivastava.\n2023. Mathprompter: Mathematical reasoning\nusing large language models. CoRR.\nOtto Jespersen. 1922. Language: Its Nature, De-\nvelopment, and Origin. Allen and Unwin.\nPhil Johnson-Laird. 2010. Deductive reasoning.\nWiley Interdisciplinary Reviews: Cognitive Sci-\nence.\nPhilip N Johnson-Laird. 1999. Deductive reasoning.\nAnnual review of psychology.\nJaehun Jung, Lianhui Qin, Sean Welleck, Faeze\nBrahman, Chandra Bhagavatula, Ronan Le Bras,\nand Yejin Choi. 2022. Maieutic prompting: Logi-\ncally consistent reasoning with recursive expla-\nnations. In EMNLP.\nJared Kaplan, Sam McCandlish, Tom Henighan,\nTom B. Brown, Benjamin Chess, Rewon Child,\nScott Gray, Alec Radford, Jeffrey Wu, and Dario\nAmodei. 2020. Scaling laws for neural language\nmodels. CoRR.\nSeyed Mehran Kazemi, Najoung Kim, Deepti Bha-\ntia, Xin Xu, and Deepak Ramachandran. 2022.\nLAMBADA: backward chaining for automated\nreasoning in natural language. CoRR.\nTakeshi Kojima, Shixiang Shane Gu, Machel Reid,\nYutaka Matsuo, and Yusuke Iwasawa. 2022.\nLarge language models are zero-shot reason-\ners. In NeurIPS.\nRosa Patricio S. La, Brooks J. Paul, Deych Elena,\nEdward L. Boone, David J. Edwards, Wang Qin,\nSodergren Erica, Weinstock George, William D.\nShannon, and Ethan P. White. 2012. Hypothe-\nsis testing and power calculations for taxonomic-\nbased human microbiome data. Plos One.\nChia-Hsuan Lee, Hao Cheng, and Mari Osten-\ndorf. 2021. Dialogue state tracking with a lan-\nguage model using schema-driven prompting. In\nEMNLP. Association for Computational Linguis-\ntics.\nHanmeng Liu, Ruoxi Ning, Zhiyang Teng, Jian Liu,\nQiji Zhou, and Yue Zhang. 2023a. Evaluating the\nlogical reasoning ability of chatgpt and GPT-4.\nCoRR.\nHanmeng Liu, Zhiyang Teng, Leyang Cui, Chaoli\nZhang, Qiji Zhou, and Yue Zhang. 2023b. Logi-\ncot: Logical chain-of-thought instruction-tuning\ndata collection with GPT-4. CoRR.\nPengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao\nJiang, Hiroaki Hayashi, and Graham Neubig.\n2023c. Pre-train, prompt, and predict: A sys-\ntematic survey of prompting methods in natural\nlanguage processing. ACM Comput. Surv.\nMohammad Sadegh Rasooli and Joel R. Tetreault.\n2015. Yara parser: A fast and accurate depen-\ndency parser. Computing Research Repository,\narXiv:1503.06733. Version 2.\nRaymond Reiter. 1981. On closed world data bases.\nIn Readings in Artificial Intelligence.\nStuart J. Russell and Peter Norvig. 2010. Artificial\nIntelligence - A Modern Approach, Third Interna-\ntional Edition. Pearson Education.\nAbulhair Saparov and He He. 2022. Language mod-\nels are greedy reasoners: A systematic formal\nanalysis of chain-of-thought. CoRR.\nCharles Joseph Singer, E. J. Holmyard, and A. R.\nHall, editors. 1954–58. A history of technology.\nOxford University Press, London. 5 vol.\nJannik Strötgen and Michael Gertz. 2012. Temporal\ntagging on different domains: Challenges, strate-\ngies, and gold standards. In Proceedings of the\nEight International Conference on Language Re-\nsources and Evaluation (LREC’12), pages 3746–\n3753, Istanbul, Turkey. European Language Re-\nsource Association (ELRA).\nS. Superman, B. Batman, C. Catwoman, and S. Spi-\nderman. 2000. Superheroes experiences with\nbooks, 20th edition. The Phantom Editors Asso-\nciates, Gotham City.\nOyvind Tafjord, Bhavana Dalvi, and Peter Clark.\n2021.\nProofwriter:\nGenerating implications,\nproofs, and abductive statements over natural\nlanguage. In Findings of ACL.\nRomal Thoppilan, Daniel De Freitas, Jamie Hall,\nNoam Shazeer, Apoorv Kulshreshtha, Heng-\nTze Cheng, Alicia Jin, Taylor Bos, Leslie\nBaker, Yu Du, YaGuang Li, Hongrae Lee,\nHuaixiu Steven Zheng, Amin Ghafouri, Marcelo\nMenegali, Yanping Huang, Maxim Krikun, Dmitry\nLepikhin, James Qin, Dehao Chen, Yuanzhong\nXu, Zhifeng Chen, Adam Roberts, Maarten\nBosma, Yanqi Zhou, Chung-Ching Chang, Igor\nKrivokon, Will Rusch, Marc Pickett, Kathleen S.\nMeier-Hellstern, Meredith Ringel Morris, Tulsee\nDoshi, Renelito Delos Santos, Toju Duke, Johnny\nSoraker, Ben Zevenbergen, Vinodkumar Prab-\nhakaran, Mark Diaz, Ben Hutchinson, Kristen Ol-\nson, Alejandra Molina, Erin Hoffman-John, Josh\nLee, Lora Aroyo, Ravi Rajakumar, Alena Butryna,\nMatthew Lamm, Viktoriya Kuzmina, Joe Fenton,\nAaron Cohen, Rachel Bernstein, Ray Kurzweil,\nBlaise Aguera-Arcas, Claire Cui, Marian Croak,\nEd H. Chi, and Quoc Le. 2022. Lamda: Lan-\nguage models for dialog applications. CoRR.\nXuezhi Wang, Jason Wei, Dale Schuurmans,\nQuoc V. Le, Ed H. Chi, and Denny Zhou. 2022.\nSelf-consistency improves chain of thought rea-\nsoning in language models. CoRR.\nJason Wei, Yi Tay, Rishi Bommasani, Colin Raf-\nfel, Barret Zoph, Sebastian Borgeaud, Dani Yo-\ngatama, Maarten Bosma, Denny Zhou, Donald\nMetzler, Ed H. Chi, Tatsunori Hashimoto, Oriol\nVinyals, Percy Liang, Jeff Dean, and William Fe-\ndus. 2022a. Emergent abilities of large language\nmodels. Trans. Mach. Learn. Res.\nJason Wei, Xuezhi Wang, Dale Schuurmans,\nMaarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi,\nQuoc V. Le, and Denny Zhou. 2022b. Chain-\nof-thought prompting elicits reasoning in large\nlanguage models. In NeurIPS.\nZhiheng Xi, Senjie Jin, Yuhao Zhou, Rui Zheng,\nSongyang Gao, Tao Gui, Qi Zhang, and Xuanjing\nHuang. 2023. Self-polish: Enhance reasoning in\nlarge language models via problem refinement.\nJingfeng Yang, Hongye Jin, Ruixiang Tang, Xiao-\ntian Han, Qizhang Feng, Haoming Jiang, Bing\nYin, and Xia Hu. 2023. Harnessing the power of\nllms in practice: A survey on chatgpt and beyond.\nCoRR.\nEric Zelikman, Yuhuai Wu, Jesse Mu, and Noah D.\nGoodman. 2022. Star: Bootstrapping reasoning\nwith reasoning. In NeurIPS.\nChuanyang Zheng, Zhengying Liu, Enze Xie, Zhen-\nguo Li, and Yu Li. 2023. Progressive-hint prompt-\ning improves reasoning in large language mod-\nels.\nDenny Zhou, Nathanael Schärli, Le Hou, Jason\nWei, Nathan Scales, Xuezhi Wang, Dale Schuur-\nmans, Olivier Bousquet, Quoc Le, and Ed H. Chi.\n2022. Least-to-most prompting enables complex\nreasoning in large language models. CoRR.\n', metadata={'Published': '2024-05-09', 'Title': 'Hypothesis Testing Prompting Improves Deductive Reasoning in Large Language Models', 'Authors': 'Yitian Li, Jidong Tian, Hao He, Yaohui Jin', 'Summary': 'Combining different forms of prompts with pre-trained large language models\nhas yielded remarkable results on reasoning tasks (e.g. Chain-of-Thought\nprompting). However, along with testing on more complex reasoning, these\nmethods also expose problems such as invalid reasoning and fictional reasoning\npaths. In this paper, we develop \\textit{Hypothesis Testing Prompting}, which\nadds conclusion assumptions, backward reasoning, and fact verification during\nintermediate reasoning steps. \\textit{Hypothesis Testing prompting} involves\nmultiple assumptions and reverses validation of conclusions leading to its\nunique correct answer. Experiments on two challenging deductive reasoning\ndatasets ProofWriter and RuleTaker show that hypothesis testing prompting not\nonly significantly improves the effect, but also generates a more reasonable\nand standardized reasoning process.'})
```

```python
print(docs[0].metadata)
```

```output
{'Published': '2024-05-09', 'Title': 'Hypothesis Testing Prompting Improves Deductive Reasoning in Large Language Models', 'Authors': 'Yitian Li, Jidong Tian, Hao He, Yaohui Jin', 'Summary': 'Combining different forms of prompts with pre-trained large language models\nhas yielded remarkable results on reasoning tasks (e.g. Chain-of-Thought\nprompting). However, along with testing on more complex reasoning, these\nmethods also expose problems such as invalid reasoning and fictional reasoning\npaths. In this paper, we develop \\textit{Hypothesis Testing Prompting}, which\nadds conclusion assumptions, backward reasoning, and fact verification during\nintermediate reasoning steps. \\textit{Hypothesis Testing prompting} involves\nmultiple assumptions and reverses validation of conclusions leading to its\nunique correct answer. Experiments on two challenging deductive reasoning\ndatasets ProofWriter and RuleTaker show that hypothesis testing prompting not\nonly significantly improves the effect, but also generates a more reasonable\nand standardized reasoning process.'}
```

## Lazy Load

If we're loading a  large number of Documents and our downstream operations can be done over subsets of all loaded Documents, we can lazily load our Documents one at a time to minimize our memory footprint:

```python
docs = []

for doc in loader.lazy_load():
    docs.append(doc)

    if len(docs) >= 10:
        # do some paged operation, e.g.
        # index.upsert(doc)

        docs = []
```

In this example we never have more than 10 Documents loaded into memory at a time.

## Use papers summaries as documents

You can use summaries of Arvix paper as documents rather than raw papers:

```python
docs = loader.get_summaries_as_docs()
docs[0]
```

```output
Document(page_content='Combining different forms of prompts with pre-trained large language models\nhas yielded remarkable results on reasoning tasks (e.g. Chain-of-Thought\nprompting). However, along with testing on more complex reasoning, these\nmethods also expose problems such as invalid reasoning and fictional reasoning\npaths. In this paper, we develop \\textit{Hypothesis Testing Prompting}, which\nadds conclusion assumptions, backward reasoning, and fact verification during\nintermediate reasoning steps. \\textit{Hypothesis Testing prompting} involves\nmultiple assumptions and reverses validation of conclusions leading to its\nunique correct answer. Experiments on two challenging deductive reasoning\ndatasets ProofWriter and RuleTaker show that hypothesis testing prompting not\nonly significantly improves the effect, but also generates a more reasonable\nand standardized reasoning process.', metadata={'Entry ID': 'http://arxiv.org/abs/2405.06707v1', 'Published': datetime.date(2024, 5, 9), 'Title': 'Hypothesis Testing Prompting Improves Deductive Reasoning in Large Language Models', 'Authors': 'Yitian Li, Jidong Tian, Hao He, Yaohui Jin'})
```

## API reference

For detailed documentation of all ArxivLoader features and configurations head to the API reference: [python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.arxiv.ArxivLoader.html#langchain_community.document_loaders.arxiv.ArxivLoader](https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.arxiv.ArxivLoader.html#langchain_community.document_loaders.arxiv.ArxivLoader)
